Assignment 1

Author

Norma Marshall

Step 1

Given the formulated question from the assignment description, you will now conduct EDA Checklist items 2-4. First, download 2002 and 2022 data for all sites in California from the EPA Air Quality Data website. Read in the data using data.table(). For each of the two datasets, check the dimensions, headers, footers, variable names and variable types. Check for any data issues, particularly in the key variable we are analyzing. Make sure you write up a summary of all of your findings.

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.5
✔ ggplot2   3.5.1     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(data.table)

Attaching package: 'data.table'

The following objects are masked from 'package:lubridate':

    hour, isoweek, mday, minute, month, quarter, second, wday, week,
    yday, year

The following object is masked from 'package:purrr':

    transpose

The following objects are masked from 'package:dplyr':

    between, first, last
library(leaflet)
library(ggplot2)
library(lubridate)
epa02 <- read.csv("EPA2002.csv")
dim(epa02)
[1] 15976    22
str(epa02)
'data.frame':   15976 obs. of  22 variables:
 $ Date                          : chr  "01/05/2002" "01/06/2002" "01/08/2002" "01/11/2002" ...
 $ Source                        : chr  "AQS" "AQS" "AQS" "AQS" ...
 $ Site.ID                       : int  60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
 $ POC                           : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Daily.Mean.PM2.5.Concentration: num  25.1 31.6 21.4 25.9 34.5 41 29.3 15 18.8 37.9 ...
 $ Units                         : chr  "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
 $ Daily.AQI.Value               : int  81 93 74 82 98 115 89 62 69 107 ...
 $ Local.Site.Name               : chr  "Livermore" "Livermore" "Livermore" "Livermore" ...
 $ Daily.Obs.Count               : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Percent.Complete              : num  100 100 100 100 100 100 100 100 100 100 ...
 $ AQS.Parameter.Code            : int  88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
 $ AQS.Parameter.Description     : chr  "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
 $ Method.Code                   : int  120 120 120 120 120 120 120 120 120 120 ...
 $ Method.Description            : chr  "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" ...
 $ CBSA.Code                     : int  41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
 $ CBSA.Name                     : chr  "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
 $ State.FIPS.Code               : int  6 6 6 6 6 6 6 6 6 6 ...
 $ State                         : chr  "California" "California" "California" "California" ...
 $ County.FIPS.Code              : int  1 1 1 1 1 1 1 1 1 1 ...
 $ County                        : chr  "Alameda" "Alameda" "Alameda" "Alameda" ...
 $ Site.Latitude                 : num  37.7 37.7 37.7 37.7 37.7 ...
 $ Site.Longitude                : num  -122 -122 -122 -122 -122 ...
summary(epa02)
     Date              Source             Site.ID              POC       
 Length:15976       Length:15976       Min.   :60010007   Min.   :1.000  
 Class :character   Class :character   1st Qu.:60290014   1st Qu.:1.000  
 Mode  :character   Mode  :character   Median :60590007   Median :1.000  
                                       Mean   :60549600   Mean   :1.581  
                                       3rd Qu.:60731002   3rd Qu.:1.000  
                                       Max.   :61131003   Max.   :6.000  
                                                                         
 Daily.Mean.PM2.5.Concentration    Units           Daily.AQI.Value 
 Min.   :  0.00                 Length:15976       Min.   :  0.00  
 1st Qu.:  7.00                 Class :character   1st Qu.: 39.00  
 Median : 12.00                 Mode  :character   Median : 56.00  
 Mean   : 16.12                                    Mean   : 59.28  
 3rd Qu.: 20.50                                    3rd Qu.: 72.00  
 Max.   :104.30                                    Max.   :185.00  
                                                                   
 Local.Site.Name    Daily.Obs.Count Percent.Complete AQS.Parameter.Code
 Length:15976       Min.   :1       Min.   :100      Min.   :88101     
 Class :character   1st Qu.:1       1st Qu.:100      1st Qu.:88101     
 Mode  :character   Median :1       Median :100      Median :88101     
                    Mean   :1       Mean   :100      Mean   :88215     
                    3rd Qu.:1       3rd Qu.:100      3rd Qu.:88502     
                    Max.   :1       Max.   :100      Max.   :88502     
                                                                       
 AQS.Parameter.Description  Method.Code  Method.Description   CBSA.Code    
 Length:15976              Min.   :117   Length:15976       Min.   :12540  
 Class :character          1st Qu.:120   Class :character   1st Qu.:23420  
 Mode  :character          Median :120   Mode  :character   Median :40140  
                           Mean   :297                      Mean   :33270  
                           3rd Qu.:707                      3rd Qu.:41740  
                           Max.   :810                      Max.   :49700  
                                                            NA's   :929    
  CBSA.Name         State.FIPS.Code    State           County.FIPS.Code
 Length:15976       Min.   :6       Length:15976       Min.   :  1.00  
 Class :character   1st Qu.:6       Class :character   1st Qu.: 29.00  
 Mode  :character   Median :6       Mode  :character   Median : 59.00  
                    Mean   :6                          Mean   : 54.78  
                    3rd Qu.:6                          3rd Qu.: 73.00  
                    Max.   :6                          Max.   :113.00  
                                                                       
    County          Site.Latitude   Site.Longitude  
 Length:15976       Min.   :32.63   Min.   :-124.2  
 Class :character   1st Qu.:34.07   1st Qu.:-121.4  
 Mode  :character   Median :35.36   Median :-119.1  
                    Mean   :36.00   Mean   :-119.4  
                    3rd Qu.:37.77   3rd Qu.:-117.9  
                    Max.   :41.71   Max.   :-115.5  
                                                    
epa22 <- read.csv("EPA2022.csv")
dim(epa22)
[1] 59756    22
str(epa22)
'data.frame':   59756 obs. of  22 variables:
 $ Date                          : chr  "01/01/2022" "01/02/2022" "01/03/2022" "01/04/2022" ...
 $ Source                        : chr  "AQS" "AQS" "AQS" "AQS" ...
 $ Site.ID                       : int  60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
 $ POC                           : int  3 3 3 3 3 3 3 3 3 3 ...
 $ Daily.Mean.PM2.5.Concentration: num  12.7 13.9 7.1 3.7 4.2 3.8 2.3 6.9 13.6 11.2 ...
 $ Units                         : chr  "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
 $ Daily.AQI.Value               : int  58 60 39 21 23 21 13 38 59 55 ...
 $ Local.Site.Name               : chr  "Livermore" "Livermore" "Livermore" "Livermore" ...
 $ Daily.Obs.Count               : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Percent.Complete              : num  100 100 100 100 100 100 100 100 100 100 ...
 $ AQS.Parameter.Code            : int  88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
 $ AQS.Parameter.Description     : chr  "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
 $ Method.Code                   : int  170 170 170 170 170 170 170 170 170 170 ...
 $ Method.Description            : chr  "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" ...
 $ CBSA.Code                     : int  41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
 $ CBSA.Name                     : chr  "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
 $ State.FIPS.Code               : int  6 6 6 6 6 6 6 6 6 6 ...
 $ State                         : chr  "California" "California" "California" "California" ...
 $ County.FIPS.Code              : int  1 1 1 1 1 1 1 1 1 1 ...
 $ County                        : chr  "Alameda" "Alameda" "Alameda" "Alameda" ...
 $ Site.Latitude                 : num  37.7 37.7 37.7 37.7 37.7 ...
 $ Site.Longitude                : num  -122 -122 -122 -122 -122 ...
summary(epa22)
     Date              Source             Site.ID              POC       
 Length:59756       Length:59756       Min.   :60010007   Min.   : 1.00  
 Class :character   Class :character   1st Qu.:60290019   1st Qu.: 1.00  
 Mode  :character   Mode  :character   Median :60631006   Median : 3.00  
                                       Mean   :60563315   Mean   : 3.77  
                                       3rd Qu.:60731026   3rd Qu.: 3.00  
                                       Max.   :61131003   Max.   :24.00  
                                                                         
 Daily.Mean.PM2.5.Concentration    Units           Daily.AQI.Value 
 Min.   : -6.700                Length:59756       Min.   :  0.00  
 1st Qu.:  4.100                Class :character   1st Qu.: 23.00  
 Median :  6.800                Mode  :character   Median : 38.00  
 Mean   :  8.429                                   Mean   : 39.28  
 3rd Qu.: 10.700                                   3rd Qu.: 54.00  
 Max.   :302.500                                   Max.   :454.00  
                                                                   
 Local.Site.Name    Daily.Obs.Count Percent.Complete AQS.Parameter.Code
 Length:59756       Min.   :1       Min.   :100      Min.   :88101     
 Class :character   1st Qu.:1       1st Qu.:100      1st Qu.:88101     
 Mode  :character   Median :1       Median :100      Median :88101     
                    Mean   :1       Mean   :100      Mean   :88192     
                    3rd Qu.:1       3rd Qu.:100      3rd Qu.:88101     
                    Max.   :1       Max.   :100      Max.   :88502     
                                                                       
 AQS.Parameter.Description  Method.Code  Method.Description   CBSA.Code    
 Length:59756              Min.   :143   Length:59756       Min.   :12540  
 Class :character          1st Qu.:170   Class :character   1st Qu.:31080  
 Mode  :character          Median :170   Mode  :character   Median :40140  
                           Mean   :336                      Mean   :34957  
                           3rd Qu.:707                      3rd Qu.:41860  
                           Max.   :810                      Max.   :49700  
                                                            NA's   :4567   
  CBSA.Name         State.FIPS.Code    State           County.FIPS.Code
 Length:59756       Min.   :6       Length:59756       Min.   :  1.00  
 Class :character   1st Qu.:6       Class :character   1st Qu.: 29.00  
 Mode  :character   Median :6       Mode  :character   Median : 63.00  
                    Mean   :6                          Mean   : 56.19  
                    3rd Qu.:6                          3rd Qu.: 73.00  
                    Max.   :6                          Max.   :113.00  
                                                                       
    County          Site.Latitude   Site.Longitude  
 Length:59756       Min.   :32.58   Min.   :-124.2  
 Class :character   1st Qu.:34.07   1st Qu.:-121.4  
 Mode  :character   Median :36.49   Median :-119.6  
                    Mean   :36.24   Mean   :-119.6  
                    3rd Qu.:37.96   3rd Qu.:-117.9  
                    Max.   :41.76   Max.   :-115.5  
                                                    

Daily PM2.5

summary(epa02$Daily.Mean.PM2.5.Concentration)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    7.00   12.00   16.12   20.50  104.30 
summary(epa22$Daily.Mean.PM2.5.Concentration)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -6.700   4.100   6.800   8.429  10.700 302.500 
sum(is.na(epa02$Daily.Mean.PM2.5.Concentration))
[1] 0
sum(is.na(epa22$Daily.Mean.PM2.5.Concentration))
[1] 0

There are a total of 22 variables for each year’s EPA summary. There are no missing values for 2002 and 2022, however a negative value for the daily mean 2.5 concentration in 2022 suggests pottential issues in the data.

Step 2

Combine the two years of data into one data frame. Use the Date variable to create a new column for year, which will serve as an identifier. Change the names of the key variables so that they are easier to refer to in your code.

epa_merge = merge(x = epa02,y = epa22, all=TRUE)

Date is transformed to a date format and the year variable is created in a numeric format.

epa_merge$Date <- as.Date(epa_merge$Date,"%m/%d/%Y") 
epa_merge$Year <- as.numeric(format(epa_merge$Date,'%Y')) 
str(epa_merge)
'data.frame':   75732 obs. of  23 variables:
 $ Date                          : Date, format: "2002-01-01" "2002-01-01" ...
 $ Source                        : chr  "AQS" "AQS" "AQS" "AQS" ...
 $ Site.ID                       : int  60074001 60130002 60290014 60290014 60290014 60370002 60371103 60374002 60590007 60658001 ...
 $ POC                           : int  3 1 1 3 4 1 1 1 1 1 ...
 $ Daily.Mean.PM2.5.Concentration: num  10.6 20.9 26.1 30.3 31.1 32.3 39.6 47.1 47.9 66.3 ...
 $ Units                         : chr  "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
 $ Daily.AQI.Value               : int  54 73 83 90 92 94 111 130 132 159 ...
 $ Local.Site.Name               : chr  "TRAFFIC, RURAL PAVED ROAD" "Concord" "Bakersfield-California" "Bakersfield-California" ...
 $ Daily.Obs.Count               : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Percent.Complete              : num  100 100 100 100 100 100 100 100 100 100 ...
 $ AQS.Parameter.Code            : int  88502 88101 88101 88502 88502 88101 88101 88101 88101 88101 ...
 $ AQS.Parameter.Description     : chr  "Acceptable PM2.5 AQI & Speciation Mass" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "Acceptable PM2.5 AQI & Speciation Mass" ...
 $ Method.Code                   : int  731 120 120 731 731 120 120 120 120 120 ...
 $ Method.Description            : chr  "Met-One BAM-1020 W/PM2.5 SCC" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Met-One BAM-1020 W/PM2.5 SCC" ...
 $ CBSA.Code                     : int  17020 41860 12540 12540 12540 31080 31080 31080 31080 40140 ...
 $ CBSA.Name                     : chr  "Chico, CA" "San Francisco-Oakland-Hayward, CA" "Bakersfield, CA" "Bakersfield, CA" ...
 $ State.FIPS.Code               : int  6 6 6 6 6 6 6 6 6 6 ...
 $ State                         : chr  "California" "California" "California" "California" ...
 $ County.FIPS.Code              : int  7 13 29 29 29 37 37 37 59 65 ...
 $ County                        : chr  "Butte" "Contra Costa" "Kern" "Kern" ...
 $ Site.Latitude                 : num  39.3 37.9 35.4 35.4 35.4 ...
 $ Site.Longitude                : num  -122 -122 -119 -119 -119 ...
 $ Year                          : num  2002 2002 2002 2002 2002 ...

Variables are also renamed to be shorter

epa_merge <- 
  rename(epa_merge, 
         dailyPM2.5 = Daily.Mean.PM2.5.Concentration,
         dailyAQI = Daily.AQI.Value,
         lat = Site.Latitude,
         long = Site.Longitude)

Step 3

Create a basic map in leaflet() that shows the locations of the sites (make sure to use different colors for each year). Summarize the spatial distribution of the monitoring sites.

temp.pal <- colorFactor(c('skyblue','slateblue'), domain = epa_merge$Year) # Palette creation
  
leaflet(epa_merge) %>% 
  addProviderTiles('CartoDB.Positron') %>%
  addCircles(
    lat = ~lat, lng = ~long, color = ~temp.pal(Year),
    opacity = 1, fillOpacity = 1, radius = 100) %>%
  addLegend('bottomleft', pal=temp.pal, values=epa_merge$Year,
            title='Year', opacity=1)

There are more stations in 2022 than in 2002. The new sites are also in areas that are more densely populated like Los Angeles and San Francisco. Densely populated areas are on average more polluted due to transportation and reliance on cars. Monitoring PM 2.5 concentrations in these areas may be useful in understanding patterns and develop policies affecting a significant portion of Californians.

Step 4

Check for any missing or implausible values of PM in the combined dataset. Explore the proportions of each and provide a summary of any temporal patterns you see in these observations.

sum(is.na(epa_merge$dailyPM2.5))
[1] 0
setorder(epa_merge, dailyPM2.5)
epa_merge %>%
  select(Date, dailyPM2.5, dailyAQI, Local.Site.Name)
#data frame hidden becuase it's too large
setorder(epa_merge, -dailyPM2.5)
epa_merge %>%
  select(Date, dailyPM2.5, dailyAQI, Local.Site.Name)
#data frame hidden becuase it's too large

In terms of location there does not appear to be a pattern for the best air quality dates. The lowest daily pm 2.5 values were in January of both 2002 and 2022. The highest pm 2.5 values were recorded between the end of July and mid Septemer of 2022 (Summer 2022). This may be due to wildfire season which led to a lot of particulate matter and pollition in the air during htis time. The highest pm 2.5 concentration was recorded on July 31st, 2022 at 302.5 ug/m^3 in Yreka. This aligns with the McKinney fire that happened during the same summer in late July of 2022.

Step 5

Explore the main question of interest at three different spatial levels. Create exploratory plots (e.g. boxplots, histograms, line plots) and summary statistics that best suit each level of data. Be sure to write up explanations of what you observe in these data.

state

epa_merge$Year1 <- as.factor(epa_merge$Year)
library(ggplot2)
epa_merge$Year1 <- relevel(epa_merge$Year1,'2022')
ggplot(epa_merge, aes(x = dailyPM2.5,  fill = Year1)) +
  geom_histogram(bins=100, color='black',alpha=0.5,position = 'identity') +
  labs(title="Distribution of sites by Daily PM2.5 Concentration in 2002 and 2022", x="Daily PM2.5 Concentration", y= "Count")+
  xlim(0,100)
Warning: Removed 254 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_bar()`).

ggplot(epa_merge, aes(x = dailyPM2.5,  fill = Year1)) +
  geom_histogram(bins=100,position = 'dodge') +
  labs(title="Distribution of sites that reported unhealthy Daily PM2.5 Concentration in 2002 and 2022", x="Daily PM2.5 Concentration", y= "Count")+
  xlim(35,310)
Warning: Removed 73652 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 3 rows containing missing values or values outside the scale range
(`geom_bar()`).

epa_merge %>%
  group_by(Year) %>%
  summarise(mean = mean(dailyPM2.5,na.rm = TRUE),
            median = median(dailyPM2.5,na.rm = TRUE),
            sd = sd(dailyPM2.5),
            min = min(dailyPM2.5),
            max = max(dailyPM2.5),
            IQR = IQR(dailyPM2.5,na.rm = TRUE))
# A tibble: 2 × 7
   Year  mean median    sd   min   max   IQR
  <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
1  2002 16.1    12   13.9    0    104.  13.5
2  2022  8.43    6.8  7.64  -6.7  302.   6.6

There were more measurements reported in California during 2022 than in 2002 since more sites were built wihtin the 20 year period. Both years, 2022 and 2002, had a histogram that shows a positive skew in pm 2.5 concentration. The average daily pm 2.5 concentration, however, was higher in 2002 at 16.11 ug/m^3 LC than in 2022 at 8.42 ug/m^3. There was also a higher median , standar deviation, and IQR in 2002. While 2002 had more particulate matter air pollution on average, the highest pm 2.5 concentration was reported in July 2002 at 302 ug/m^3 LC.

According to the EPA, a 24 pm 2.5 concentration of 35 ug/m^3 LC and above is set to be unhealthy for sensitive gorups. The second histogram shows the distribution of pm 2.5 concentration for 35ug/m^3 LC and above in California. Less sites reported unhealthy air quality days in 2002 compared to 2002, indicating that air quality has improved over 20 years.

county

ggplot(epa_merge) +
  geom_point(mapping = aes(x = County, y = dailyPM2.5, colour = factor(Year))) +
  scale_color_manual(values=c("slateblue", "skyblue")) +
  labs(x = "County", y = "Daily. PM2.5 Concentration (ug/m^3 LC)") +
  theme(axis.text.x = element_text(angle = 90, vjust = .5, size = 5))

epa_merge %>%
  group_by(County,Year) %>%
  summarise(mean = mean(dailyPM2.5,na.rm = TRUE),
            median = median(dailyPM2.5,na.rm = TRUE),
            sd = sd(dailyPM2.5),
            min = min(dailyPM2.5),
            max = max(dailyPM2.5),
            IQR = IQR(dailyPM2.5,na.rm = TRUE))
`summarise()` has grouped output by 'County'. You can override using the
`.groups` argument.
# A tibble: 98 × 8
# Groups:   County [51]
   County        Year  mean median    sd   min   max   IQR
   <chr>        <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Alameda       2002 14.3    10   11.4    1.9  61.6 10.2 
 2 Alameda       2022  8.20    7    4.95  -0.7  35.5  5.8 
 3 Butte         2002 14.8    11.5 11.7    1    88   10.5 
 4 Butte         2022  6.19    4.5  5.79  -0.6  42.8  5.1 
 5 Calaveras     2002  9.9     8    6.50   2    40    6.25
 6 Calaveras     2022  6.04    5    4.10   0    25.9  4.15
 7 Colusa        2002 11.7     9   10.0    1    57    9.5 
 8 Colusa        2022  7.61    6.7  4.76   0.6  37    6.1 
 9 Contra Costa  2002 15.1     9.5 14.5    2    76.7 10.1 
10 Contra Costa  2022  8.25    7.3  4.92   0.9  37.3  5.6 
# ℹ 88 more rows
epa_summary <- epa_merge %>%
  group_by(County,Year) %>%
  summarise(mean = mean(dailyPM2.5,na.rm = TRUE),
            median = median(dailyPM2.5,na.rm = TRUE),
            sd = sd(dailyPM2.5),
            min = min(dailyPM2.5),
            max = max(dailyPM2.5),
            IQR = IQR(dailyPM2.5,na.rm = TRUE))
`summarise()` has grouped output by 'County'. You can override using the
`.groups` argument.
epa_County <- epa_merge %>%
  group_by(County,) 

ggplot(epa_County, aes(x = factor(Year), y = dailyPM2.5, fill = factor(Year))) +
  geom_boxplot() +
  labs(title = "Box Plot of Daily PM2.5 Concentrations in California Counties (2002 vs 2022)",
       x = "Year",
       y = "Daily PM2.5 Concentration (µg/m³)") +
  scale_fill_manual(values = c("skyblue", "lightgreen")) +
  theme_minimal() +
  theme(legend.position = "none")

Overall, mean daily pm2.5 concentrations were lower in 2022 across California counties. There were many cases (outliers) where mean daily pm 2.5 concentrations were higher in 2022. The highest daily pm concentrations appeared in mostly in the counties that were heavily influenced by the wildfires in 2022, for example, Mariposa, Nevada, Placer, Riverside, Siskiyou, and Trinity county.

SITE IN LA County

epa_MainStreet<- epa_merge %>%
  filter(Local.Site.Name == "Los Angeles-North Main Street")
ggplot(epa_MainStreet, aes(x = factor(Year), y = dailyPM2.5, fill = factor(Year))) +
  geom_boxplot() +
  labs(title = "Box Plot of Daily PM2.5 Concentrations in Los Angeles - North Main Street(2002 vs 2022)",
       x = "Year",
       y = "Daily PM2.5 Concentration (µg/m³)") +
  scale_fill_manual(values = c("skyblue", "lightgreen")) +
  theme_minimal() +
  theme(legend.position = "none")

epa_MainStreet %>%
  group_by(Year) %>%
  summarise(mean = mean(dailyPM2.5,na.rm = TRUE),
            median = median(dailyPM2.5,na.rm = TRUE),
            sd = sd(dailyPM2.5),
            min = min(dailyPM2.5),
            max = max(dailyPM2.5),
            IQR = IQR(dailyPM2.5,na.rm = TRUE))
# A tibble: 2 × 7
   Year  mean median    sd   min   max   IQR
  <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
1  2002  22.0   19.3 11.7    3.9  66.3 13   
2  2022  11.6   10.9  4.57   2.4  38    5.98

Looking closer at the Los Angeles, North Main Street station, daily pm 2.5 concentratons were lower in 2022 than in 2002. This is shown with a mean daily pm 2.5 of 11.6 ug/m^3 in 2002 and 22.0 ug/m^3 in 2022. The decrease in the median and IQR, as well as the presence of fewer outliers suggest that overall air quality has become better at this one site in Los Angeles, likely due to regulatory measures or changes in environmental conditions.